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Abstract: 

The question of whether proteins originate from random sequences of amino acids is addressed. 
A statistical analysis is performed in terms of blocked and random walk values formed by binary 
hydrophobic assignments of the amino acids along the protein chains. Theoretical expectations of 
these variables from random distributions of hydrophobicities are compared with those obtained 
from functional proteins. The results, which are based upon proteins in the SWISS-PROT data 
base, convincingly show that the amino acid sequences in proteins differ from what is expected 
from random sequences in a statistically significant way. By performing Fourier transforms on the 
random walks one obtains additional evidence for non-randomness of the distributions. 



We have also analyzed results from a synthetic model containing only two amino-acid types, hy- 
drophobic and hydrophilic. With reasonable criteria on good folding properties in terms of ther- 
modynamical and kinetic behavior, sequences that fold well are isolated. Performing the same 
statistical analysis on the sequences that fold well indicates similar deviations from randomness as 
for the functional proteins. The deviations from randomness can be interpreted as originating from 
anticorrelations in terms of an Ising spin model for the hydrophobicities. 

Our results, which differ from some previous investigations using other methods, might have impact 
on how permissive with respect to sequence specificity the protein folding process is - only sequences 
with non-random hydrophobicity distributions fold well. Other distributions give rise to energy 
landscapes with poor folding properties and hence did not survive the evolution. 
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1 Introduction 



Hydrophobicity is widely believed to play a central role in the formation of 3D protein structures. 
To understand the statistical distribution of hydrophobicity along proteins is therefore of utmost 
interest. This question has been addressed previously. In Ref. p| the authors used binary hydropho- 
bicity assignments, zero or one, and studied simultaneously the distribution of clumps of both zeros 
and ones by using the so-called run test. For the majority of the proteins examined it was found that 
the results could not be distinguished from those corresponding to completely random sequences. 
The same type of statistical test has also been applied to sequences stemming from a simplified 
protein model 0. Here randomly selected sequences were compared with sequences that had been 
specially designed to have good folding properties. The statistical analysis did not reveal any dif- 
ference between these two groups. These findings seem to indicate that the folding requirements 
on proteins are fairly permissive with little sequence specificity. A slightly different approach to 
analyze the same problem was pursued in Ref. ||, where by mapping the binary chains onto the 
trajectories of a random walk, deviations from random distributions are reported. 

Also, recent work on simplified models suggest the non-randomness {|, |j| . In these studies a large 
number of randomly selected sequences were investigated, and it was found that only a small fraction 
of them folded easily into a thermodynamically stable state. 



In this work we study the statistical distribution of hydrophobicity by using methods different from 
the run test in Ref. [Q. Along the same lines as in Ref. H rather than analyzing raw sequences 
of hydrophobicity, we focus on the corresponding random walk representation. In this way the 
analysis is more sensitive to long range correlations along the sequence. Our analysis has been 
carried out using two different methods, which differ substantially from what is used in Ref. || 
although the starting point is similar. First, we form block variables, and study how the behavior 
of these depends on the block size. When applied to the SWISS-PROT data base Q of functional 
proteins, this method yields clear evidence for non-randomness. In addition, we have performed 
a Fourier analysis based on the random walk representation. In this analysis we find non-random 
behavior at the wave-length corresponding to a-helix structure, as one might have expected, but 
also at large wave-lengths. 

In our analysis we have divided the sequences into groups corresponding to different fractions of 
hydrophobic residues. This division is important because the results for different groups deviate 
in different directions from those for random sequences. For sequences with a typical fraction 
of hydrophobic residues we find that the non-randomness can be interpreted as anticorrelations. 
This interpretation emerges from a simple Ising model of antifcrromagnctic interactions among the 
residues. 



Given the impact our results might have on the issue of how permissive with respect to sequence 
specificity the protein folding process is, we have carried out the same analysis for a toy model |t], || , 
for which unbiased samples of folding and non-folding sequences can be obtained. This model, 
hereafter denoted the AB model, consists of chains of two kinds of "amino acids" interacting with 
Lennard-Jones potentials. We have examined the behavior of 300 randomly selected chains of 
length 20 in this model M. Out of these only about 10% were found to have reasonable folding 
properties. Analyzing these sequences with the same methods as being used for the functional 
proteins, we obtain results that are qualitatively very similar to those for proteins with a typical 
fraction of hydrophobic residues. In particular, we again find deviations from random behavior that 
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correspond to anticorrelations. One should keep in mind that the toy model chains are quite short 
and highly simplified as compared to functional proteins. Nevertheless, it is appealing to attempt 
an explanation for the observed similarity in behavior as originating from the fact that those amino 
acid sequences exhibiting this type of hydrophobicity distribution are the ones that fold well. Other 
distributions give rise to energy landscapes with poor folding properties and hence did not survive 
the evolution. 

All our analysis concerns comparisons between distributions. The ultimate challenge is to decide 
whether a given sequence is non-random or not. This issue, which is beyond the scope of the paper, 
may be feasible when combining different cuts on the measures developed here. 

This paper is organized as follows. In Section 2 we develop our two methods for analyzing binary 
hydrophobicity sequences. In Section 3 and 4 these methods are applied to real and toy model 
proteins respectively. Section 4 also contains the interpretation of deviations from randomness in 
terms of anticorrelations. Finally, a brief summary and outlook can be found in Section 5. 



2 Methods 



In this section we describe the variables and statistical methods employed. Two different, but not 
completely unrelated approaches are used - the blocking and Fourier transform methods. These 
will be described in some detail below. 

Throughout the paper we consider sequences of N residues and denote by at the hydrophobicity of 
residue i. We use a binary hydrophobicity scale: Oj = 1 if residue i is hydrophobic and <j,- = — 1 
otherwise. The analysis can easily be extended to an arbitrary number of allowed hydrophobicity 
values and we do not expect our results to be affected by using such multivalued hydrophobicity 
assignments. 

Hydrophobicities ui represent local properties of a chain. As in Ref. || in order to capture some 
long range correlation properties we consider a random walk representation 

n 
i=l 

for i = 1, . . . , N and where ro = 0. 



2.1 The Blocking Method 

Analyzing the behavior of block variables is a widely used and fruitful technique in statistical 
mechanics, and our application will turn out to be no exception. For a block size s we define the 
variables 

s 

= J! a (i-l)s+j = r is ~ r (i-l)s i = !,■■-, N/s (2) 
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(s) 

where it is assumed that TV is a multiple of s. The scaling behavior of a\ with increasing s is 
determined by the correlations between Oi and aj. If the oVs are independent random numbers 
drawn from the same distribution, the correlations between different Ui and Uj vanish and the 
variance of crj 8 ^ scales linearly with s. 

We need to be able to compare real proteins with a random distribution of hydrophobic residues. For 
this reason we average over all sequences with a fixed length and composition. These averages are 
denoted by (-}n.n + , where N is the total number of residues and N + is the number of hydrophobic 
residues. 

In order to study the fluctuations of the block variables we introduce the normalized variables 

^ = ^-^<¥) i = l,..;N/e (3) 



where 



4N+(N-N+), 



The constant K is chosen such that {iI)[ s ^)n,n + — s for all N and N + . The fact that K depends 
on s implies that the variance of a\ s ^ is not linear in s, which is due to the fact that the average is 
taken at fixed composition. At fixed s this deviation from linearity disappears in the limit N — ► oo. 
If all the residues are of the same type, K vanishes. Such sequences are uninteresting in the present 
analysis and have therefore been excluded. 

An important quantity is the (normalized) mean-square fluctuation of the block variables, defined 

by 

N/s N/s N/s 

l — l j — 1 2—1 

Obviously, one has 

(^ {s) )n.n + = s (6) 

and that V^ 1 ^ = 1 is independent of configuration <r,. It is also important to know the variance 
of ?/>( s ). The complete expression for this quantity is lengthy and can be found in Appendix A. 
However, in the N — > oo limit it takes the simple form 

<^ )2 W + - <^ } )^ + - 2j ^ 1 (7) 



When studying proteins from the data base, we average over sequences with different length and 
composition. For a general quantity this requires some assumption about the probability of different 
values of N and in order to compare with random sequences. This problem is absent for ip^ s \ 
since it is defined such that (ip^)if,N + is independent of N and N + . The variance of i[)^ s \ on the 
other hand, does depend upon N and A^ + . However, for an interval N\ < N < N2 with both Ni and 
N 2 large and (N 2 — Nt) not too large, it is still possible to use Eq. (0) for estimating the variance. 
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2.2 The Fourier Transform Method 



The most direct way to detect periodicity in the distribution of hydrophobic residues is to use 
Fourier analysis. It is well-known that the Fourier component corresponding to a period of 3.6 
residues tends to be strong for sequences that form a helices |Tj{| . Also, sequences that form /3 
sheets tend to exhibit a periodicity in the hydrophobicity of about 2.3 residues. In this paper we 
compare the full power spectrum for proteins with that for random sequences. 

As a starting point for our Fourier analysis we take the random walk representation r n . Since we 
want to compare with random sequences and since any permutation of the residues leaves the end 
point r^v unchanged, it is here convenient to introduce the modified random walk (see Appendix B) 

Pa = ro = (8) 

^/ 2N+-N\ 2N+-N 
Pn = }^{ ai Jf )=r n -n — n = l,...,N (9) 

i=l 

which is defined such that po = pn = 0. With these boundary conditions, we consider the sine 
transform 

/* = X^« sin i\r k = i,...,N-i (io) 

n=l 

where the fcth component corresponds to a wave-length of 2N/k residues. 

It is easy to see that the average of fk over all sequences with a fixed length and composition 
vanishes, and for the squared amplitude we find 

_ 2N + (N-N + ) 1 

(f k )N,N + - (2sin ^ )2 < U ) 

which shows that this quantity behaves as k~ 2 for small k. In Appendix B we also give the fourth 
moment of the f k distribution. 



In our calculations we have used the normalized squared amplitude 

fk = TlS (12) 



which has an average {f^)N,N + = 1 independent of N and N + . By measuring /| one can, of course, 
only study the relative strength of the different components. In fact, it can easily be shown that 



N-l 



J2fi = N - 1 (13) 



fc=i 



independent of configuration <Ji 



3 Real Proteins 



Our analysis has been carried out using the SWISS-PROT data base, release 31 M. Some proteins 
were removed from this data base due to uncertainties (see Appendix C for details). Also, we 
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limit the analysis to proteins with TV > 50 after the endpoints have been removed according to a 
prescription to be dealt with below. To each residue we assigned a binary hydrophobicity value, 
which was taken to be +1 for Leu, He, Val, Phe, Met, and Trp and —1 for the others. This choice 
was done by picking the residues with strongest hydrophobic interactions down to a preset level. 
Alternative definitions, with 4 to ll^jof the residues classified as hydrophobic, have also been tested, 
with qualitatively similar results. 

Estimates of statistical errors on the measurements have been obtained by dividing the data into 20 
groups and treating the corresponding averages as independent measurements. All statistical errors 
quoted are in a error units. 

Before starting our final analysis of hydrophobicity correlations, we need to deal with two important 
observations. 

• The data originating from the ends of the sequences display a different behavior than the data 
from the rest of the sequences. 

• Sequences with different fractions of hydrophobic residues tend to behave in different ways. 

As a result, important effects can easily be missed out if averages are computed over the full data 
set, as will be seen below. 



3.1 The Interior versus the Ends 

We begin by examining whether the behavior of the block variable depends upon the position of 
the block along the sequence. This analysis is carried out for block sizes s — 2,3,4,6, and 12. In 
order to obtain a sequence that can be divided into blocks for each of these sizes, we disregard up 
to eleven residues at the ends. In this way we form a sequence of length N' = n ■ 12, where n is the 
largest integer such that n ■ 12 < N, N being the length of the original sequence. 

(s) 

We study the block fluctuation ip^ as a function of the relative position £ of the block center, £ 
being at the iV-end and 1 at the C-end. The interval in £ from to 1 is divided into 50 bins and 
average values were computed for each of these bins, using all sequences in the data base with more 
than 50 residues. In Fig. [I] we show the result for block size s = 4. The results for other values of 
s are similar. The horizontal line in the figure represents random sequences; if the distribution of 

(s) 

hydrophobic residues were random, the average of ip\ would be s, independent of i. 

From Fig. |l] it is clear that the block fluctuations are roughly constant over a wide range in £. 
However, it is also evident that the fluctuations tend to increase in strength at the ends, in particular 
at the iV-end. One also notices that the deviations from the random value tend to cancel if one 
averages over all positions. 

This shows that it is important to distinguish between the ends and the interior of the sequences 
when studying hydrophobicity correlations. In what follows we focus on the interior by ignoring 

4 This is the choice of Ref. 
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Figure 1: Average values of against the relative positions of the blocks along the sequence, £. 



15% of the residues at each of the two ends, and analyze sequences containing the remaining 70% 
of the residues. 



3.2 The Fraction of Hydrophobic Residues 



Our main focus in this paper is on the distribution of hydrophobic residues along the sequence, 
and to what extent this distribution has random characteristics. One may also ask whether the 
total number of hydrophobic residues in a sequence follows a random pattern. This question can be 
addressed by studying the quantity 



X 



N+~Np 
y/Np(l-p) 



(14) 



where N + is the number of hydrophobic residues, N is the total number of residues, and p is 
the average of N + /N over all sequences. If N hydrophobicity values are drawn randomly and 
independently with probability p for the value 1 and (1 — p) for -1, the distribution of X becomes 
approximately Gaussian with zero mean and unit variance for large N. 

We have calculated X for the sequences in the data base, after eliminating 30% of the residues, 
as discussed above. The average fraction of hydrophobic residues was found to be p w 0.291. The 
distribution of X obtained is shown in Fig. from which we see that the tails are larger than for 
the random distribution. 



When studying correlations in hydrophobicity, we have divided the sequences into groups corre- 
sponding to different regions in X. This division need not have a simple interpretation in terms of 
standard groups of proteins, but turns out to be useful. Indeed, we will find below that sequences 
with different X tend to display different types of correlations. 



G 



-4 -2 2 4 

X 



Figure 2: The distribution of X for the sequences in the data base. The curve is the Gaussian with 
zero mean and unit variance. 



3.3 Results 



We now turn to the results of our block and Fourier analysis. As discussed in the previous two 
subsections, we have chosen to consider the interior of the sequences and to study different regions 
in X. 

First we consider the mean-square fluctuation of the block variables, ip^ . In Fig. || results are 
shown corresponding to three different regions in X: \X\ < 0.5, \X\ > 3, and all X. The straight 
line represents random sequences. We see that the results for large X lie above this line, while 
the results for small X show the opposite behavior. The same pattern is observed when using 
alternative hydrophobicity assignments. Notice that ip^ cannot increase slower than linearly with 
s if the correlation between ai and Oj is translationally invariant and non-negative. Therefore, these 
results suggest that there exists negative hydrophobicity correlations for small X . 

We have also tested how these results depend on the length of the sequences by computing averages 
of ?/^ s ) corresponding to different intervals in N, where N is the length of the sequence prior to 



the elimination of residues at the ends. In Fig 
different intervals in N. It is clear from Fig. 



we show results obtained for \X\ < 0.5 and three 
that the size dependence is fairly weak. Another 
interesting feature is that the deviation from the result for random sequences grows with sequence 
length. Notice that the variance of ijj^ scales as N~ x / 2 for random sequences. 



Next we compare the behavior of the Fourier components for small and large \X\. In Fig. ^| we have 
plotted the normalized squared amplitude against k/N for \X\ < 0.5 and \X\ > 3. Let us first 
consider the region of small and medium wave-length. Here the results for the two intervals in \X\ 
are similar. As one might have expected, there is a peak around the wave-length corresponding to 
a-helix structure, 2N/k = 3.6. Away from this peak, the results are very close to those for random 
sequences. 

At large wave-length, on the other hand, the results show a clear \X\ dependence, and they differ 
from the results for random sequences for both small and large \X\. As can be seen from the figure, 
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Figure 3: Mean-square fluctuation of the block variables, tp^ s \ as a function of block size s for 
\X\ < 0.5 (+ ;10154 qualifying proteins), \X\ > 3 (x; 4928), and all X (o; 36765). The averages 
have been computed over sequences which contained more than 50 residues prior to the elimination 
of residues at the ends. The straight line is the result for random sequences. 




2.5 5 7.5 10 12.5 

S 

Figure 4: Mean-square fluctuation of the block variables, tp^ , against block size s for \X\ < 0.5 
and 50 < N < 150 (+; 2457 qualifying proteins), 150 < N < 250 (x; 2228), and 250 < N < 350 (o; 
1642). The straight line is the result for random sequences. 

these components are suppressed for small |X| and strong for large \X\. 



3.4 Tests on Non-Redundant Sets 

A general problem in the statistical analysis of proteins is the presence of homologies since these 
may shift away distributions from an ideal set of independent samples. 

In order to test for effects due to homologies, we redid the analysis above using a set of 486 selected 



8 



2.0 



1.5 



#4- 

0.5 — 



0.0 



0.2 0.4 0.6 0.8 1 

k/n 



2.0 



1.5 



St? i.o \ * + i +^V 



<►++-+++- 



0.5 — 



0.0 



0.2 0.4 0.6 0.8 1 

k/n 



Figure 5: Normalized squared amplitude /? against k/N for a) X| < 0.5 and b) \X\ > 3. The sets 
of sequences considered are the same as in Fig. 3. 



sequences [[j~2|0 from the Protein Data Bank (jl3| . This set was obtained by allowing for a maximum 
of 25% sequence similarity for aligned subsequences of more than 80 residues jyj. Within this set 
of minimally redundant sequences, 185 with \X\ < 0.5 and 5 with \X\ > 3.0 qualified for analysis. 
The results for \X\ < 0.5 are within statistics identical to those described above. For \X\ > 3.0 the 
results are not in conflict with the results above but quantitative comparisons are not meaningful 
due to the extremely small sample size. 

The fact that our results survive when limiting ourselves to non-redundant proteins, implying a 
substantial cut in number of proteins involved in the analysis, makes the evidence of non-randomness 
even stronger. 



4 A Simplified Synthetic Model 

In this section we carry through the same hydrophobicity analysis as above for a simple toy model 
for proteins 0, || with binary amino acids - The AB model. Due to its simplicity and the relatively 
small sizes involved the folding properties of this model have been studied to quite some detail j|, §| ■ 
The question we want to address here is whether the sequences, which have good folding properties 
in the AB model, deviate from the non-folding ones in a way qualitatively similar to what was found 
for the small- 1 X | functional proteins above. As will be shown below this is indeed the case. 



4.1 The AB Model 



In this model, there are two kinds of residues with cr, = ±1 (A and B) respectively. These are linked 
by rigid bonds to form linear chains in two dimensions. The interactions between the residues are 

5 The March 1996 edition was used. 
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Figure 6: Mean-square fluctuation of the block variables, against block size s for good folding 
sequences in the AB model. Also shown are the mean s (full line) and the s ± a band (bounded by 
dotted lines) for random sequences. 

given by di-dependent Lennard- Jones potentials such that (H — h) is strongly attractive, ( ) weakly 

attractive and (H — ) repulsive. In Ref. Q the thermodynamics of this system at low temperature 
was studied using the hybrid Monte Carlo method. Fluctuations in the shape for a given chain were 
studied by measuring the mean-square distance S 2 between pairs of configurations; the probability 
distribution of <5 2 , for fixed temperature and sequence, describes the magnitude of the thermody- 
namically relevant fluctuations. It is suggestive to interpret a low average S 2 as a signal for good 
folding and stability properties. Recently an attempt to understand the systematics of how low 
8 2 values relate to the <7j sequence was pursued In this work 300 randomly selected sequences 
with 14 A and 6 B residues were studied, using an improved Monte Carlo method |l(], ||]. The 
sequences were classified as having good folding properties if the average S 2 was less than 0.3, or if 
the probability of S 2 < 0.1 was greater than 0.35. This yielded a total of 37 good folders (roughly 
10 %). 



4.2 Results 

Using the 37 good folding sequences, we have repeated the analysis of the previous section. This 
set of sequences is fairly small, but has the advantage that it is has been generated in a bias-free 
way. Statistical errors given in this section have been obtained by taking the results for different 
sequences as independent measurements. 

In Fig. H we show the mean-square fluctuation of the block variables, tp( s h The average of iJj^ 
over 37 random sequences has an approximately Gaussian distribution, with mean s and a standard 
deviation a that can be obtained by using the results of Appendix A. In the figure we have indicated 
the position of the s ± a band. We see that the data points lie clearly below this band. 

Our results for the squared Fourier amplitude are shown in Fig. [?]. Although the statistical errors 
on this quantity are large, there are clear deviations from the result for random sequences at large 
wave-length. We see that components corresponding to large wave-lengths are suppressed. 
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Figure 7: Normalized squared amplitude /? against k/N for good folding sequences in the AB 
model. The full line and dots are as in Fig. 6. The standard deviation for random sequences can 
be obtained by using the results of Appendix B. 

These results show that good folding sequences in the AB model tend to exhibit small block fluctu- 
ations and weak Fourier components at large wave-length. Qualitatively, the results are very similar 
to those obtained in the previous section for small \X\. 



4.3 Interpretation of the Results 

In this paper we have compared various results with those for random sequences. Random sequences 
correspond to a situation in which there is (essentially) no correlation between cr, and Oj for i j. 
A simple but instructive way to introduce non-zero correlations into the system is to consider the 
one-dimensional Ising model. In this model there are N "spins" <7j that take the values ±1, and 
each configuration is given a statistical weight 



As in our previous calculations, we consider configurations with a fixed number of positive spins, i.e., 
the magnetization M = Yli=i a i = 2A^ + — N is held fixed. Also, as before free boundary conditions 
are assumed. 

The properties of this system are determined by the parameter K . Neighboring spins tend to point 
in the same direction if K > (ferromagnet), and in opposite directions if K < (antiferromagnet). 
For K — we recover the (random) system studied previously. 

To illustrate the behavior of the system for non-zero K, we show in Fig. || results obtained at 
K = ±0.25. As in our AB calculations, we have taken N — 20 and N + = 14. At K = 0.25 
we see that block fluctuations are large and that Fourier components with large wave-length are 
strong, while the behavior is the opposite at K — —0.25. This means that the results for this 
antiferromagnetic system are similar to those obtained for good folding sequences in the AB model 
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Figure 8: a) Mean-square fluctuation of the block variable, %jM and b) normalized squared am- 
plitude /| for the Ising model with K = —0.25 (+) and K = 0.25 (x). The lines correspond to 
K = 0. 



and for protein sequences with small \X\. On the other hand, the results for the ferromagnetic 
system resemble those for proteins with large \X\. 



5 Summary 



We have demonstrated that the statistical distribution of hydrophobic residues along chains of 
functional proteins are non-random. This result is in contrast with what was concluded in Ref. 
An important reason for this difference is probably that the blocking and Fourier analysis methods 
are able to capture long range correlations more efficiently than the method of Ref. Jl| . In Ref. Q , 
on the other hand, a method more similar to ours was used and deviations from random behavior 
were observed, but the deviations may seem to differ in nature from what we have found. However, 
it is important to note that these authors focused on hydrophilicity rather than hydrophobicity, 
as they used a binary classification in which five strongly hydrophilic residues formed one group. 
Also, the interpretation of the results of Ref. |j| is somewhat unclear, as no distinction was made 
between the interior and the ends of the sequences. When limiting the data set to non-redundant 
protein chains the results from the analysis are unaffected. Hence we consider our evidence for 
non-randomness as being quite robust. 

We have also applied our analysis method to a toy model data base (AB model) where chains with 
good folding properties were distinguished from the rest. The hydrophobicity distributions of the 
good folding sequences differ from random ones in qualitatively the same way as for the low-|X| 
functional protein analysis. It is tempting to interpret this similarity as indicating that only those 
proteins with good folding properties have survived the evolution. 

The deviation from randomness in the AB model case can be understood as originating from anti- 
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correlations among the residues. The effects of correlations and anticorrelations on the observables 
considered were illustrated by using the simple one-dimensional Ising model. 

Our analysis has been a statistical one in the sense that distributions are being compared. Given our 
encouraging result, it might be possible to reach the ultimate goal of being able to classify individual 
sequences in terms of belonging to one category or the other. This might be feasable by considering 
suitable cuts in the block and Fourier quantities. Very likely, one then needs to augment the method 
with additional discriminative variables and an automated procedure like artificial neural networks 
for setting the cuts. 

Our analysis has been confined to binary hydrophobicity assignments. The results presented are 
insensitive to minor modifications of these assignments. We do not expect the results to change 
significantly if instead of binary assignments multivalued ones are used. 
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Appendix A. 



Variance of ib^ 



In this Appendix we give the variance of (see Eq. |). The average of V> (s) over all sequences 
with fixed composition, N + and A_ = N — N + , is given by 

where is the connected correlation between <jj and Oj , for which one finds 

C%j = (&i&j)N.N + — {&i)N,N + {&j)N,N + = \ (A2) 

_ J yv ± w_ ... , . 

\ N 2 (N-1) 11 1 T J 

Using this, one obtains (ip^) n,n + = s (Eq. §)• The off-diagonal correlation, with i ^ j, has to 
be negative since 53j=i EfLi c y = 0' but vanishes in the limit N — > oo. 

The variance can be computed in a similar way. In addition to c.y , one then needs the correlation 
between four oVs. One finds 

(^ s)2 )n,n + - (^ s) )%, N+ = 2s 2 (s - l)N~ 1 K~ 2 G(s ) A, N+) (A3) 

where 

G(s,N,N + ) = l + 2(s-2) 1 ± W ^ T) (A4) 

(A + - A_) 4 - 6A(A + - A_) 2 + 3A 2 + 8(A + - A_) 2 - 6 A 
_( S_ ' A(A-l)(A-2)(A-3) 
1 _ / (4A-6)(A + -A_) 4 
2 [ ' \N{N- l) 2 (A-2)(A-3) 

-6A(A+ - 7V_) 2 + 3A 2 + 8(A+ - A_) 2 - 6A 2(A+ - 7V_) 2 - N 



(N — 1)(N — 2)(N — 3) (A-l) 2 
In the limit N — > oo, for fixed s, this expression simplifies to Eq. 0. 
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Appendix B. 



Fourier Transforms of Random Walk Representations 



In this Appendix Fourier transform moments of random walk representations are listed. The ex- 
pressions are more general than what is required for binary hydrophobicity assignments. In order 
to do this we first list the following basic quantities for sequences of length N: 



• Moment of order k, mk- 

1 v^-W k 
m k = AT 2^=1^ • 

• Cumulant of order k, Ck- 

C2 = rri2 — m\ 

C4 = 1TI4 — Am^mi — 'dm\ + Ylmirr^y — 6mf 

• Random walk, p n . 

Po = 

Pn = J2"=l a i ~ nm l i = l,...,N (p N =0) 

• Sine transform of p n , fk 

fk=En=lPn^^ k = l,...,N-l 



Averaging over all sequences with fixed composition, i.e., all permutations of er^, one obtains 



(Pn) = 



(Pi) 
(fk) 

(fk) 
(ft) 



N 2 c 2 n n 
N - 1 ' AT ~ N' 



= 



N 2 c 2 1 
2(A-1) ■ (2sin^)2 

3N 4 



A(N - l)(JV-2) 

— S 2 k.N 



24(N - 3) 



4- 












(c 4 -i 





(c 4 + 6C2) 



(2sin^)4 



(Bl) 
(B2) 
(B3) 
(B4) 



(B5) 



For the binary scale Oi — ±1 one has c 2 = AN + (N — N + )/N 2 and Eq. B4 becomes Eq. |TT]. 
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Appendix C. 



Removal of Uncertain Sequences 

In our analysis we have removed "uncertain sequences" from the SWISS-PROT database by ignoring 
all entries containing the following feature keys in their feature key table: 

• UNSURE - Indicates that there are uncertainties in the sequence. 

• NON.CONS - Indicates that two residues in a sequence are not consecutive and that there 
are a number of unsequenced residues in between. 

• NON.TER - The residue at an extremity of the sequence is not the terminal residue. 

This reduces the size of the SWISS-PROT database from 43470 to 38050 protein entries. 

Furthermore, when analyzing the interior parts of protein sequences, sequences containing the fol- 
lowing letters 

• B denoting Aspartic acid or Asparagine. 

• Z denoting Glutamine or Glutamic acid. 

• X denoting any amino acid. 

within the interior arc removed. 
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